Vectorizing Operations with Pandas: Speed Galore!

Often, when you have your data in a Pandas DataFrame or a Numpy ndarray, you want to perform some sort of operation on every cell in your data frame. You could loop over it, and sometimes you may have to. But you should always try to vectorize your operations since this takes advantage of the Pandas/Numpy system of (arrayed) vectors.

Let's jump right in.


In [2]:
import pandas as pd
import numpy as np
#First, we create an 8x8 array of random integers.
df = pd.DataFrame(data = np.random.randint(0, 100, size = (8,8)), index = ('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'))
print(df) #since we are using random numbers, your array will always be different.


    0   1   2   3   4   5   6   7
a  63  95  72  28  46   1  30  34
b  94  45  10  12  58  45  12  78
c  52   9  22  25  44  86  52  74
d  23  63  97  35  26  47  73  49
e  16   4  85  76  44  10  46  23
f  27  27  29  85  65  75  80  28
g  65  27  88  97  18  31  71  61
h  80  19  81  64  23  21  85  44

[8 rows x 8 columns]

Now, as mentioned in the introduction, if we wanted to multiply every cell by 2, we could loop through each row and then each column of the DataFrame, multiply everything by two, and then return the result. Something like this.


In [3]:
def times_2(df):
    val_dict = {}
    for index, values in df.iterrows():
        val_dict[index] = []
        for value in values:
            val_dict[index].append(value * 2)
    df2 = pd.DataFrame(val_dict).T # transposes the df since dictionary elements are read as columns instead of rows.
    df2.index = df.index
    df2.columns = df.columns
    return df2

df2 = times_2(df)
print(df2)


     0    1    2    3    4    5    6    7
a  126  190  144   56   92    2   60   68
b  188   90   20   24  116   90   24  156
c  104   18   44   50   88  172  104  148
d   46  126  194   70   52   94  146   98
e   32    8  170  152   88   20   92   46
f   54   54   58  170  130  150  160   56
g  130   54  176  194   36   62  142  122
h  160   38  162  128   46   42  170   88

[8 rows x 8 columns]

But this is not a very efficient way to perform this operation. Let's take a look at an easier and faster way to do this: vectorizing.


In [4]:
df3 = df * 2
print(df3)
print(df3 == df2) # element-wise comparison
print(df3.all() == df2.all()) # column-wise comparison
print(df3.all(axis = 1) == df2.all(axis = 1)) # row-wise comparison


     0    1    2    3    4    5    6    7
a  126  190  144   56   92    2   60   68
b  188   90   20   24  116   90   24  156
c  104   18   44   50   88  172  104  148
d   46  126  194   70   52   94  146   98
e   32    8  170  152   88   20   92   46
f   54   54   58  170  130  150  160   56
g  130   54  176  194   36   62  142  122
h  160   38  162  128   46   42  170   88

[8 rows x 8 columns]
      0     1     2     3     4     5     6     7
a  True  True  True  True  True  True  True  True
b  True  True  True  True  True  True  True  True
c  True  True  True  True  True  True  True  True
d  True  True  True  True  True  True  True  True
e  True  True  True  True  True  True  True  True
f  True  True  True  True  True  True  True  True
g  True  True  True  True  True  True  True  True
h  True  True  True  True  True  True  True  True

[8 rows x 8 columns]
0    True
1    True
2    True
3    True
4    True
5    True
6    True
7    True
dtype: bool
a    True
b    True
c    True
d    True
e    True
f    True
g    True
h    True
dtype: bool

Not only does the code take much less time to write (only one line), it also takes much less time to run. Check it out.


In [5]:
%timeit times_2(df)
%timeit df * 2


1000 loops, best of 3: 787 µs per loop
10000 loops, best of 3: 45.4 µs per loop

Look at the difference! On my computer, vectorizing the operation (the second one) is about 19x faster than the loop. This is why you should always try to vectorize your operations if you have a large data set.
You can also perform arithmetic operations between objects in a vectorized manner.


In [6]:
df + df3


Out[6]:
0 1 2 3 4 5 6 7
a 189 285 216 84 138 3 90 102
b 282 135 30 36 174 135 36 234
c 156 27 66 75 132 258 156 222
d 69 189 291 105 78 141 219 147
e 48 12 255 228 132 30 138 69
f 81 81 87 255 195 225 240 84
g 195 81 264 291 54 93 213 183
h 240 57 243 192 69 63 255 132

8 rows × 8 columns


In [7]:
df - df3


Out[7]:
0 1 2 3 4 5 6 7
a -63 -95 -72 -28 -46 -1 -30 -34
b -94 -45 -10 -12 -58 -45 -12 -78
c -52 -9 -22 -25 -44 -86 -52 -74
d -23 -63 -97 -35 -26 -47 -73 -49
e -16 -4 -85 -76 -44 -10 -46 -23
f -27 -27 -29 -85 -65 -75 -80 -28
g -65 -27 -88 -97 -18 -31 -71 -61
h -80 -19 -81 -64 -23 -21 -85 -44

8 rows × 8 columns


In [8]:
-df


Out[8]:
0 1 2 3 4 5 6 7
a -63 -95 -72 -28 -46 -1 -30 -34
b -94 -45 -10 -12 -58 -45 -12 -78
c -52 -9 -22 -25 -44 -86 -52 -74
d -23 -63 -97 -35 -26 -47 -73 -49
e -16 -4 -85 -76 -44 -10 -46 -23
f -27 -27 -29 -85 -65 -75 -80 -28
g -65 -27 -88 -97 -18 -31 -71 -61
h -80 -19 -81 -64 -23 -21 -85 -44

8 rows × 8 columns


In [9]:
df * df3


Out[9]:
0 1 2 3 4 5 6 7
a 7938 18050 10368 1568 4232 2 1800 2312
b 17672 4050 200 288 6728 4050 288 12168
c 5408 162 968 1250 3872 14792 5408 10952
d 1058 7938 18818 2450 1352 4418 10658 4802
e 512 32 14450 11552 3872 200 4232 1058
f 1458 1458 1682 14450 8450 11250 12800 1568
g 8450 1458 15488 18818 648 1922 10082 7442
h 12800 722 13122 8192 1058 882 14450 3872

8 rows × 8 columns


In [10]:
df / df3


Out[10]:
0 1 2 3 4 5 6 7
a 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
b 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
c 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
d 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
e 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
f 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
g 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
h 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

8 rows × 8 columns

If we want to apply a function to each column of a DataFrame, we can simply call the method.
There is also the apply method that applies a function row-wise.


In [11]:
from timeit import timeit

print(df.mean(axis = 1)) #This applies column-wise.
print('Vectorized function takes %s seconds' % timeit('np.mean(df)', 'from __main__ import df, np', number = 1000))
print(df.apply(np.mean, axis = 1)) #This applies the function row-wise.
print('Apply method takes %s seconds' % timeit('df.apply(np.mean, axis = 1)', 'from __main__ import df, np', number = 1000))


a    46.125
b    44.250
c    45.500
d    51.625
e    38.000
f    52.000
g    57.250
h    52.125
dtype: float64
Vectorized function takes 0.07512190900160931 seconds
a    46.125
b    44.250
c    45.500
d    51.625
e    38.000
f    52.000
g    57.250
h    52.125
dtype: float64
Apply method takes 0.6022517460078234 seconds

But notice how much longer the apply method takes. So while the apply method can be useful for functions you implement yourself, you should first check whether there is a function in Pandas that does the same thing because it will almost certainly be faster.

Note: we will see this later in the week when we implement our own cosine similarity function.

You can also apply a Pandas Series (or even scalar values) to a DataFrame element-wise.


In [12]:
s = pd.Series(np.random.randint(0, 8, size = 8))
s


Out[12]:
0    2
1    0
2    7
3    2
4    2
5    2
6    4
7    4
dtype: int64

In [13]:
df


Out[13]:
0 1 2 3 4 5 6 7
a 63 95 72 28 46 1 30 34
b 94 45 10 12 58 45 12 78
c 52 9 22 25 44 86 52 74
d 23 63 97 35 26 47 73 49
e 16 4 85 76 44 10 46 23
f 27 27 29 85 65 75 80 28
g 65 27 88 97 18 31 71 61
h 80 19 81 64 23 21 85 44

8 rows × 8 columns


In [ ]:
df * s

In [ ]:
df.multiply(s) #this is slightly faster than the above

In [ ]:
np.multiply(df, s) #the numpy function is about 33% bit faster than either of the previous ones

In [ ]:
df * 5

And, finally, notice what happens if your Series (or DataFrame) is a different size than your original.


In [ ]:
s2 = pd.Series(np.random.randint(0, 8, size = 3))
s2

In [ ]:
df

In [ ]:
df * s2

Final Activity:

Below write a small script that performs the following operations on the two DataFrames:
1. Raises every cell in one DataFrame to the power of the corresponding cell in the other, e.g.,

$\text{df.ix[0,0]}^{\textrm{df2.ix[0,0]}}$

2. Takes the xth root of every cell in one DataFrame, where x is the number in the corresponding cell in the other DF, e.g.,

$\sqrt[\text{df2.ix[0,0]}]{\text{df.ix[0,0]}}$

3. Takes the square root of the cell-wise sums of the DFs, e.g.,

$\sqrt{\text{df.ix[0,0]} + \text{df2.ix[0,0]}}$

4. Takes the natural log of the result of the each cell in one DF raised to the power of the corresponding cell in the other, e.g.,

$\ln(\text{df.ix[0,0]}^{\text{df2.ix[0,0]}})$

Hint: If any of these take more than one line of code or you are using apply, you are not vectorizing. Try Again!


In [24]:
# Insert your code for number 1 here.
df**df2


Out[24]:
0 1 2 3 4 5 6 7
a -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 1 -9223372036854775808 -9223372036854775808
b -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
c -9223372036854775808 150094635296999136 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
d -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
e -9223372036854775808 65536 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
f -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
g -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808
h -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808 -9223372036854775808

8 rows × 8 columns


In [39]:
# Insert your code for number 2 here.
np.float128(df)**(1/df2)


Out[39]:
0 1 2 3 4 5 6 7
a 2.817313 3.121986 2.912951 3.036589 3.583048 1.000000 3.107233 3.239612
b 3.113737 3.556893 3.162278 3.464102 2.759669 3.556893 3.464102 2.971828
c 3.732511 3.000000 2.802039 2.924018 3.530348 3.045262 3.732511 2.932972
d 2.843867 2.817313 3.138289 3.271066 2.962496 3.608826 2.923013 3.659306
e 4.000000 4.000000 3.036370 2.952592 3.530348 3.162278 3.583048 2.843867
f 3.000000 3.000000 3.072317 3.036370 2.839412 2.942831 2.990698 3.036589
g 2.839412 3.000000 3.062814 3.138289 4.242641 3.141381 2.902783 2.794682
h 2.990698 4.358899 3.000000 2.828427 2.843867 2.758924 3.036370 3.530348

8 rows × 8 columns


In [37]:
np.sqrt(df+df2)


Out[37]:
0 1 2 3 4 5 6 7
a 8.185353 9.949874 8.717798 5.567764 7.000000 1.000000 5.744563 6.082763
b 9.899495 6.928203 3.464102 3.741657 7.874008 6.928203 3.741657 9.055385
c 7.416198 3.316625 5.000000 5.291503 6.855655 9.486833 7.416198 8.831761
d 5.099020 8.185353 10.049876 6.164414 5.385165 7.071068 8.774964 7.211103
e 4.242641 2.236068 9.433981 8.944272 6.855655 3.464102 7.000000 5.099020
f 5.477226 5.477226 5.656854 9.433981 8.306624 8.888194 9.165151 5.567764
g 8.306624 5.477226 9.591663 10.049876 4.472136 5.830952 8.660254 8.062258
h 9.165151 4.582576 9.219544 8.246211 5.099020 4.898979 9.433981 6.855655

8 rows × 8 columns


In [38]:
# Insert your code for number 4 here.
np.log(df**df2)


Out[38]:
0 1 2 3 4 5 6 7
a 16.572539 18.215508 17.106664 9.996614 11.485924 0.000000 10.203592 10.579082
b 18.173179 11.419987 4.605170 4.969813 16.241772 11.419987 4.969813 17.426835
c 11.853731 4.394449 9.273127 9.656627 11.352569 17.817389 11.853731 17.216260
d 9.406483 16.572539 18.298844 10.666044 9.774290 11.550443 17.161838 11.675461
e 5.545177 1.386294 17.770605 17.322933 11.352569 4.605170 11.485924 9.406483
f 9.887511 9.887511 10.101887 17.770605 16.697549 17.269952 17.528107 9.996614
g 16.697549 9.887511 17.909347 18.298844 5.780744 10.301962 17.050720 16.443495
h 17.528107 5.888878 17.577797 16.635532 9.406483 9.133567 17.770605 11.352569

8 rows × 8 columns

Take a close look at your answers. If they don't make sense or there are problems with them, figure out why and correct them in your code.